from sklearn.preprocessing import LabelEncoderdf_encoded = df.copy()label_encoders = {}for col in features:if df_encoded[col].dtype =="object": le = LabelEncoder() df_encoded[col] = le.fit_transform(df_encoded[col]) label_encoders[col] = le
from sklearn.model_selection import train_test_splitX = df_encoded[features]y = df_encoded[target]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
import plotly.express as pxfig = px.bar( x=rf.feature_importances_, y=features, orientation='h', labels={'x': 'Importance', 'y': 'Feature'}, title='Feature Importance – ML Role Classification')fig.update_layout( yaxis=dict(categoryorder='total ascending'), margin=dict(l=100, r=20, t=50, b=20), height=500, template='plotly_white')fig.write_html('figures/rm_model_plot1.html', include_plotlyjs='cdn', full_html=False)
This bar chart displays the feature importance scores from a random forest model predicting whether a job role involves ML/Data Science. The most influential feature by far in the model is the job title (TITLE), which has a significantly higher importance than all other variables. Secondary contributors include industry classification (NAICS2_NAME) and minimum years of experience, where education level and SOC code had relatively low influence on the model’s prediction. This is suggesting that the job title alone carries strong predictive power for identifying ML-related roles.
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_extraction.text import TfidfVectorizer# Cleaned job descriptionsdf['BODY_clean'] = df['BODY'].fillna("").str.lower()# Targety = df['REQUIRES_ML'] # this should be a binary 1/0 column# TF-IDF vectorizationtfidf = TfidfVectorizer(max_features=5000, stop_words='english')X = tfidf.fit_transform(df['BODY_clean'])
from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import classification_reportX_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)model = RandomForestClassifier(random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)print(classification_report(y_test, y_pred))
import numpy as npimportances = model.feature_importances_top_idx = np.argsort(importances)[-20:]top_words = tfidf.get_feature_names_out()[top_idx]top_importances = importances[top_idx]fig = px.bar( x=top_importances, y=top_words, orientation='h', labels={'x': 'Importance', 'y': 'Word'}, title='Top 20 TF-IDF Words for ML Role Classification')fig.update_layout( yaxis={'categoryorder':'total ascending'}, margin=dict(l=120, r=20, t=60, b=20), width=800, height=600, template='plotly_white')fig.write_html('figures/rm_model_plot2.html', include_plotlyjs='cdn', full_html=False)
This bar chart shows the top words contributing to the classification of job roles as Machine Learning (ML)related based on job description data. Surprisingly, the most influential words are “attention,” “chain,” and “supply”, which could be an indication of overlap with supply chain roles or reflect noise in the model. More expected terms like “machine,” “learning,” “python,” “AI,” and “analytics” also appear, reinforcing that relevant technical language still plays a role in identifying ML-related positions. The presence of general words like “strong” or “communication” suggests that not all influential terms are strictly technical.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplayimport numpy as npimport plotly.figure_factory as fflabels = [str(lbl) for lbl in model.classes_]cm = confusion_matrix(y_test, y_pred)labels = [str(c) for c in model.classes_] fig = ff.create_annotated_heatmap( z=cm, x=labels, y=labels, colorscale='Blues', showscale=True, annotation_text=cm, hoverinfo='z')fig.update_layout( title='Confusion Matrix – ML Role Classification', xaxis_title='Predicted Label', yaxis_title='Actual Label', xaxis=dict(tickmode='array', tickvals=list(range(len(labels))), ticktext=labels), yaxis=dict(tickmode='array', tickvals=list(range(len(labels))), ticktext=labels), width=700, height=600, template='plotly_white', margin=dict(l=80, r=20, t=60, b=80))fig.write_html("figures/rm_model_plot3.html", include_plotlyjs='cdn', full_html=False)
We selected a combination of structured and unstructured features to predict whether a job role requires Machine Learning or Data Science. Structured features such as TITLE, SOC_2021_4_NAME, NAICS2_NAME, MIN_EDULEVELS_NAME, and MIN_YEARS_EXPERIENCE were chosen based on domain relevance—these fields reflect the role’s function, industry, required education, and experience level, all of which can signal ML-related requirements. Additionally, we included the job description BODY text, applying TF-IDF vectorization to extract key terms. This allowed the model to learn from nuanced language patterns within postings. Feature importance and performance metrics confirm that both structured metadata and text data contribute meaningfully to classification accuracy.